A distributable German clinical corpus containing cardiovascular clinical routine doctor’s letters

We present CARDIO:DE, the first freely available and distributable large German clinical corpus from the cardiovascular domain. CARDIO:DE encompasses 500 clinical routine German doctor’s letters from Heidelberg University Hospital, which were manually annotated. Our prospective study design complies well with current data protection regulations and allows us to keep the original structure of clinical documents consistent. In order to ease access to our corpus, we manually de-identified all letters. To enable various information extraction tasks the temporal information in the documents was preserved. We added two high-quality manual annotation layers to CARDIO:DE, (1) medication information and (2) CDA-compliant section classes. To the best of our knowledge, CARDIO:DE is the first freely available and distributable German clinical corpus in the cardiovascular domain. In summary, our corpus offers unique opportunities for collaborative and reproducible research on natural language processing models for German clinical texts.


Preliminaries
Medication classes are annotated with the annotation tool INCEpTION. This project uses an iterative guidelines adaptation process using inter-annotator agreement, to ensure high annotation quality (Roberts et al. 2007). To easily track any changes to these guidelines a history section is provided in Appendix 1. Do not make assumptions or consider longitudinal information you may know about a patient in this workflow.

Classes to annotate medication information
For each medication listed in a discharge letter, the following information should be annotated, if available:

Description of classes
Annotation objective is the identification of a relevant drug/active ingredient and its relation attributes (dosage, route, frequency, duration, strength and form). In addition we annotate the DRUG/ACTIVEING class with the attribute TYPE, if the medication information is described in a narrative, non-semi-structured section. In this section we describe each medication information class, including an example. f) We also include brackets in an annotated entity: Ass <ACTIVEING> -LRB-Acetylsalicylsäure -RRB-<ACTIVEING>.
g) If a DRUG is given with more than one generic name specified in brackets, we create one ACTIVEING annotation for all generic names: Amiloretik <DRUG> -LRB-Amiloridhydrochlorid, Hydrochlorothiazid -RRB-<ACTIVEING>.
• specified time of day or hours • bei BEdarf, max 3x tägl.

REASON
a) The medical reason for which the medication is stated to be given. Indications for which the medication would normally be given but which are not asserted by the text to be the reason for administering a medication are not included. Reasons/indications are usually given by adjective phrases or noun phrases. They usually correspond to diseases, signs or symptoms, and information related to other medications. The reason needs to be in the same sentence as ACTIVEING/DRUG or maximum in a preceding or succeeding sentence or list item. We include adjectives in annotations: primärprophylaktischen Plaquestabilisierung. A therapy, e.g. antibiotische Therapie is annotated as a REASON, if it is further described by ACTIVEING. See, Fig. 14 • Flutiform 250 inh.

• Heparin Perfuso
• Inhalation mit Sultanol und Atrovent The attributes inNarrative and isSuggested are only annotated for ACTIVEING and DRUG elements.
• inNarrative: the medication is located in a free text section of a discharge letters (including Diagnosis). The medication is not part of a semistrucured section e.g. Medikation, Therapieempfehlung, Medikation bei Aufnahme ...
• isSuggested: prescribtion of the medication is suggested, discussed or adviced. This has to be mentioned explicitly in the text with terms like In Diskussion, wird in Erwägung gezogen or similar. isSuggested is not used in semi-structured sections.

Special cases
• Articles: An indefinite article is not included in annotated entities. For example, in the noun phrase ein Antibiotikum, the span annotated is Antibiotikum not ein Antibiotikum.

6
• Prepositions: Avoid prepositions, e.g. bei, mit, except where they provide meaning or create a contiguous span. DO include when prepositional phrases add meaning such as in FREQUENCY and DURATION spans: E.g. pro Tag, eine Woche lang, bis 14. März.
• Punctuation: We include in annotation periods used with abbreviations: 1 Tabl.. • If brackets are not paired, we do not annotate them. E.g. -LRB-Acetylsalicylsäure.
• UNK (unknown token): If it is clear from the context what is meant with UNK we create a separate annotation for it: E.g. UNK <DOSAGE> Tabl. <FORM>. If UNK token is just placed after drug names, it is most probably a copyright symbol and we include this UNK character in a DRUG annotation: E.g. MarcumarUNK <DRUG>, Kalium VerlaUNK <DRUG>.
• if FREQUENCY and STRENGTH including units as mg, ml or similar in one token combined, we choose STRENGTH. E.g. 2x5mg, 400mg-200mg-200mg.
• If DRUG name is including STRENGTH in middle of term, we annotate the while sequence of token as DRUG, see -Adding rules to annotate composita to section 5.
added positive examples for ROUTE, DRUG, ACTIVEING, FREQUENCY.
-Restricted context window for REASON to max. succeeding or preceding sentence.
included Sauerstoff as ACTIVEING • 05.05.2022: -Added rules to annotate ACTIVEING in context of conjunctions and brackets.
-Split annotations of DRUG and ACTIVEING at conjunctions.
-Annotation rules about context length of REASON.
including adjectives in REASON annotation.
-isSuggested only used if suggestion is explicitly mentioned in section 4.
-Added DURATION examples in section 4.
if FREQUENCY and STRENGTH including units as mg, ml or similar in one token combined, we choose STRENGTH. E.g. 2x5mg, added in section 5.
-Adding REASON and ACTIVEING examples in section 4.
splitting REASON and ROUTE rules in section 5.
-Added REASON examples in section 4.
-Added exception to not annotate allergies. Patient must be experiencer of medication, see section 5.
-Added DURATION example.
• 12.05.2022: -Added Fig 13, to visualize bracket annotation, if they contain more than one distinct class type.

8
-Added example for STRENGTH.
-Avoid prepositions in annotations, if they do not give medication related meaning, see section 5.
-Added rules to annotate therapies as REASON.

Preliminaries
Section classes are annotated with the annotation tool INCEpTION. This project uses an iterative guidelines adaptation process using inter-annotator agreement, to ensure high annotation quality (Roberts et al. 2007). To easily track any changes to these guidelines a history section is provided in 7. Do not make assumptions or consider longitudinal information you may know about a patient in this workflow. We only annotate the first line of the section class. This can be a header, a simple token or a sequence of token. Examples, see 1, 2 and 3

Classes to annotate medication information
This section contains a list of all section types used in this project.
• Abschluss -Description: Information about allergies, intolerances and cardiovascular risks. Typically initialized with Kardiovaskuläre Risikofaktoren, Cvrf, Allergien or similar. In our discharge letters, risk factors and allergies are typically two distinct sections.
• Zusammenfassung -CDA code: https://wiki.hl7.de/index.php?title=IG:Arztbrief_Plus# Zusammenfassung_des_Aufenthalts -Description: In the Epicrisis / Summary of stay section, a special summary review is recorded, an interpretation of the patient's events as well as the initiated therapy, which is intended for the physician providing further treatment. Typically initialized with Epikrise, Zusammenfassung, or similar.
• M ix -CDA code: X -Description: All content, which is not fitting in one of the above sections classes. This excludes laboratory values, which we do not annotate. Can typically appear after risk factors, diagnosis, conclusion.

Special cases
We do not annotate laboratory values, as they are typically inside improperly converted tables in our text documents.

Referenzen
Lohr  -Do not forget to switch section types, especially inside Befunde section.
-Echo section can appear inside Diagnosis section.
-Added example to Echo.
-The section introduced by the header Aktuelle Medikation, if located after anamnesis or risk factors, we annotate as AufnahmeMedikation.
8 Example snippets     Table S1 BERT -Classification report (token-wise including B-and I-substrings) Table S2 BERT -Classification report (token-wise removing B-and I-substrings) Table S3 BERT -Classification report (entity-wise, strict IOB) Table S4 BERT -Hyperparameters Table S5 CRF -Classification report (token-wise including B-and I-substrings) Table S6 CRF -Classification report (token-wise removing B-and I-substrings) Table S7 CRF -Classification report (entity-wise, strict IOB) Table S8 CRF -Hyperparameters Figure S1 CRF-Features Section classification Figure S2 BERT-Confusion matrix. Table S9 BERT -Features. Figure S3 SVM-Confusion matrix. Table S10 SVM-Hyperparameters and features.         During class mapping, we removed the IOB format of all class labels. 3 The beginning and end of an entity are based on different assumptions due to different definitions of Clinical Drug and our labels, thus this scheme produced various annotation errors.

Evaluation
In our evaluations we renamed all mapped classes including Clinical Drug consistently to DRUG. GGPONC NER was released in four versions. We show the results of the best performing model on our data: 04_ggponc_fine_long (Table 3-4).  Results of the short mapping show a low precision, while recall achieved 67%. This indicates a large amount of false positive predictions. The long mapping could clearly improve both precision and recall scores. In-depth analysis confirmed our assumptions, that the Clinical Drug class of Snomed CT covers our Frequency and Strength classes, too.
We further investigated frequently appearing false positive predictions of both models (Table 5-7).   Frequent false negatives of both models were ActiveIng like: ASS, Clopidogrel or Vitamin D.
In addition Drugs like Panzytrat or Tromcardin were frequently not recognized as Drug.
In Table 8-9 we show results of the 02_ggponc_fine_short model. Details, see (Borchert et al., 2022).  Further investigations and experiments to compare both model types we leave for future work.